[Day24] 使用 LLM 自動生成 Fine-tuning 資料集

2024 iThome 鐵人賽

DAY 24

生成式 AI

從系統設計切入，探索 GenAI 在企業中的實踐系列第 24 篇

16th鐵人賽

Pei

團隊SI夢想工程隊

2024-09-25 20:01:58

607 瀏覽

分享至

以下參考課程 LLM Twin: Building Your Production-Ready AI Replica 撰寫

什麼是 LLM Fine-tuning？

LLM Fine-tuning 是一種對大型語言模型（如 GPT-3.5 或 Mistral-7B）進行微調的技術，旨在透過小型數據集進一步訓練模型，提升其在特定任務上的表現。預訓練模型學習了廣泛的知識，但 Fine-tuning 能夠讓模型專注於某一領域，從而在專業任務上提供更精確的回應。

簡單來說，Fine-tuning 就像是將一個「萬能模型」轉變為針對特定領域的「專家模型」，這不僅提升了模型應對日常任務的能力，還能在特定場景中提供更具針對性的結果。

Fine-tuning 的重要性
預訓練模型雖然擁有豐富的通用知識，但在專業領域中，通常無法達到最好的效果。舉例來說，模型可以回答常見的知識問題，但在涉及更專業的法律條款或醫學知識時，模型可能無法提供足夠精確的回應。透過 Fine-tuning，我們能夠進一步調整模型，使其能在特定領域內提高準確度和效率。

第一步：準備微調訓練資料集

開始微調之前，需要準備一個專門的訓練資料集。這些資料集包含「指令-內容」資料對，幫助模型學習如何生成適合特定指令的回應。

為什麼使用「指令-內容」資料集？
「指令-內容」資料集幫助模型理解特定任務。例如，如果希望模型能撰寫 LinkedIn 貼文，則可提供貼文資料並生成相應指令。透過大量類似的「指令-內容」資料對，模型能精準應對未來類似的任務。

以下是一個準備資料集的範例，我們從 Qdrant 平台上提取數據，並利用 OpenAI GPT-3.5-turbo 自動生成「指令-內容」資料對：

{
  "author_id": "2",
  "cleaned_content": "Do you want to learn to build hands-on LLM systems using good LLMOps practices? A new Medium series is coming up...",
  "platform": "linkedin",
  "type": "posts"
},
{
  "author_id": "2",
  "cleaned_content": "RAG systems are far from perfect. This free course teaches you how to improve your RAG system...",
  "platform": "linkedin",
  "type": "posts"
}

第二步：使用 `DataFormatter` 類

為了自動生成訓練資料，課程使用 DataFormatter 類將資料式化成 LLM 可以理解的格式，並產生對應的指令。

`DataFormatter`：將資料格式化成模型能接受的提示內容

DataFormatter 是一個自動化工具，用於將資料組織成 LLM 可理解的格式，以便模型進行微調。這樣不僅節省了手動整理的時間，還確保了資料的結構正確。

data_type = "posts"  
USER_PROMPT = (  
    f"I will give you batches of contents of {data_type}. Please generate me exactly 1 instruction for each of them. The {data_type} text "  
    f"for which you have to generate the instructions is under Content number x lines. Please structure the answer in json format,"  
    f"ready to be loaded by json.loads(), a list of objects only with fields called instruction and content. For the content field, copy the number of the content only!."  
    f"Please do not add any extra characters and make sure it is a list with objects in valid json format!\n"  
)

class DataFormatter:  
    @classmethod  
    def format_data(cls, data_points: list, is_example: bool, start_index: int) -> str:  
        text = ""  
        for index, data_point in enumerate(data_points):  
            if not is_example:  
                text += f"Content number {start_index + index }\n"  
            text += str(data_point) + "\n"  
        return text

    @classmethod  
    def format_batch(cls, context_msg: str, data_points: list, start_index: int) -> str:  
        delimiter_msg = context_msg  
        delimiter_msg += cls.format_data(data_points, False, start_index)  
        return delimiter_msg

    @classmethod  
    def format_prompt(cls, inference_posts: list, start_index: int):  
        initial_prompt = USER_PROMPT  
        initial_prompt += f"You must generate exactly a list of {len(inference_posts)} json objects, using the contents provided under CONTENTS FOR GENERATION\n"  
        initial_prompt += cls.format_batch(  
            "\nCONTENTS FOR GENERATION: \n", inference_posts, start_index  
        )  
        return initial_prompt

第三步：使用 `DatasetGenerator` 類自動化資料集生成

課程中使用了 DatasetGenerator ，自動化了訓練資料的生成過程。這個類別負責從資料庫中提取資料、格式化資料，並將結果推送至訓練平台（如 Comet ML）進行微調。

`DatasetGenerator`：自動化訓練資料生成過程

DatasetGenerator 提供自動化處理的功能，包括文件處理、API 通信與資料格式化，減少了手動操作的需求。

class DatasetGenerator:  
    def __init__(self, file_handler, api_communicator, data_formatter):  
        self.file_handler = file_handler  
        self.api_communicator = api_communicator  
        self.data_formatter = data_formatter

def generate_training_data(self, collection_name: str, batch_size: int = 1):  
    all_contents = self.fetch_all_cleaned_content(collection_name)  
    response = []  
    for i in range(0, len(all_contents), batch_size):  
        batch = all_contents[i : i + batch_size]  
        initial_prompt = self.data_formatter.format_prompt(batch, i)  
        response += self.api_communicator.send_prompt(initial_prompt)  
        for j in range(i, i + batch_size):  
            response[j]["content"] = all_contents[j]

    self.push_to_comet(response, collection_name)

提取資料內容
fetch_all_cleaned_content 方法從 Qdrant 中提取所有已清理的內容，準備進行後續處理：

def fetch_all_cleaned_content(self, collection_name: str) -> list:  
    all_cleaned_contents = []  
    scroll_response = client.scroll(collection_name=collection_name, limit=10000)  
    points = scroll_response[0]  
    for point in points:  
        cleaned_content = point.payload["cleaned_content"]  
        if cleaned_content:  
            all_cleaned_contents.append(cleaned_content)  
    return all_cleaned_contents

這個方法使用了 Qdrant 的 scroll 功能批量提取資料，並準備生成「指令-內容」資料對。

最終生成結果

透過上述流程，我們得到了由 LLM 生成的訓練資料集，這些資料可作為微調模型的訓練集。以下是生成結果的範例：

[
  {
    "instruction": "撰寫一篇 LinkedIn 貼文，宣傳即將推出的關於 LLM 系統實作課程，重點介紹 LLMOps 的最佳實踐，並鼓勵讀者關注更多課程內容。",
    "content": "Do you want to learn to build hands-on LLM systems using good LLMOps practices?..."
  },
  {
    "instruction": "撰寫一篇 LinkedIn 貼文，宣傳一個免費的 RAG 系統改進課程，強調學習最新的技術如查詢擴展和嵌入自適應技術。",
    "content": "RAG systems are far from perfect. This free course teaches you how to improve your RAG system..."
  }
]

這樣的資料集讓模型能夠在撰寫 LinkedIn 貼文等任務中表現出更高的準確性，並提高其應對類似任務的能力。

ref.